08. Quiz: Gather (Download)

Gather: Download

The dataset used in this lesson is hosted on this Kaggle Datasets page: Armenian Online Job Postings . Some context on this dataset, from the description section of that page:

The online job market is a good indicator of overall demand for labor in an economy. This dataset consists of 19,000 job postings from 2004 to 2015 posted on CareerCenter, an Armenian human resource portal.

Since postings are text documents and tend to have similar structures, text mining can be used to extract features like posting date, job title, company name, job description, salary, and more. Postings that had no structure or were not job-related were removed. The data was originally scraped from a Yahoo! mailing group.

Downloading Files Manually vs. Programmatically

Gather: Download

Best Practice: Downloading Files Programmatically

When downloading files from the internet, downloading can be done manually by clicking the download button (or sometimes right-clicking on a link and clicking "Save file as" ). But best practice is actually to download files programmatically, i.e. with code, for two reasons: scalability and reproducibility .

  1. Scalability : Imagine you had a thousand files to download on a thousand different web pages, instead of just one. It'd take an eternity to point and click a thousand times. You can do the same with a few lines of code.

  2. Reproducibility : Someone, whether it's you or another person, is likely going to want to run your analysis later, so make downloading the dataset or datasets as easy on that person as possible. Reproducibility is also one of the main principles of the scientific method . You want to be able to prove to people that your analysis, visualization, etc. is legitimate. People need to know that given your data, your computational environment, your code, etc., that they can reproduce your results! Plus, the dataset or the web page it lives on may change, so if you include the date you downloaded the dataset, you give these future onlookers a chance to access archived copies of the dataset or at least understand why their results are different.

For This Walkthrough…

Gather: Download II

Quiz

Quiz: Download

QUESTION:

What is the name of the downloaded file?

SOLUTION:

NOTE: The solutions are expressed in RegEx pattern. Udacity uses these patterns to check the given answer

Solution

Gather: Download Solution